Semibulk RNA-seq analysis as a convenient method for measuring gene expression statuses in a local cellular environment

When biologically interpretation of the data obtained from the single-cell RNA sequencing (scRNA-seq) analysis is attempted, additional information on the location of the single cells, behavior of the surrounding cells, and the microenvironment they generate, would be very important. We developed an inexpensive, high throughput application while preserving spatial organization, named “semibulk RNA-seq” (sbRNA-seq). We utilized a microfluidic device specifically designed for the experiments to encapsulate both a barcoded bead and a cell aggregate (a semibulk) into a single droplet. Using sbRNA-seq, we firstly analyzed mouse kidney specimens. In the mouse model, we could associate the pathological information with the gene expression information. We validated the results using spatial transcriptome analysis and found them highly consistent. When we applied the sbRNA-seq analysis to the human breast cancer specimens, we identified spatial interactions between a particular population of immune cells and that of cancer-associated fibroblast cells, which were not precisely represented solely by the single-cell analysis. Semibulk analysis may provide a convenient and versatile method, compared to a standard spatial transcriptome sequencing platform, to associate spatial information with transcriptome information.

first, marker genes for each cell-type were detected using the FindAllMarker function of Seurat with the following options: logfc.threshold = 1 and min.pct = 0.8. The cell-type fractions were then estimated using SPOTlight v1.0.1 following the instruction manual with the cell. For RCTD, estimation of cell-type fractions was performed using RCTD of spacexr v2.0.1 following the manual with some modifications. To create a reference object including the entire normalized single cell data, the reference function was used with the following options: require_int = F and n_max_cells = 20000. The option of RCTD for the minimum number of cells required per cell-type was set to 20.
We found that Seurat and RCTD showed similar results with CIBERSORTx ( Fig. 3c and Supplementary Fig. 4). For SPOTlight, T lymphocytes, VSMCs, intercalated cells, principal cells, and cells from the loop of Henle (LoH) were predicted as major components, whereas proximal tubular cells were predicted as relatively minor components in the kidney. Previous studies show that proximal tubular cells were the most abundant cell-type in the mouse kidney 5,6 . Moreover, proximal tubular cells and LoH are present at different locations, the cortex and medulla, respectively 27 . However, these cell types were frequently predicted in the same semibulks in the SPOTlight results. These discrepancies were not observed in the CIBERSORTx, Seurat, or RCTD results. Therefore, at least in our case, CIBERSORTx, Seurat, and RCTD are a more robust tool for deconvolution compared with SPOTlight.
Supplementary Figure 5. Correlation between the gene expression of pseudo-semibulks and real semibulk data.
Boxplot of the Pearson correlation coefficients between real semibulk and corresponding pseudo-semibulk. The expression patterns of virtual semibulk were constructed from the cellular compositions that were predicted by CIBERSORTx 25 , Seurat 24 , SPOTlight 22 , or RCTD 23 and the averaged gene expression of scRNA-seq for each cell-type.
To further validate the precise prediction of the cellular components, we generated the pseudo-semibulk based on the predicted cellular components by CIBERSORTx, Seurat, SPOTlight, and RCTD ( Supplementary Fig. 5). When we compared the correlation between these pseudo-semibulks and real semibulks, we found that the average of the correlation coefficients was 0.57 in CIBERSORTx, Seurat, and UMI counts roughly correlates with cell density or mRNA cell content 34 . To roughly estimate cell density at each spot, we compared UMI counts among spots ( Supplementary Fig. 6a). Although the spots of cluster 7, mainly derived from damaged thin tissue fragments, showed lower UMI counts compared to other clusters, others showed almost comparable UMI counts among each other. Furthermore, we manually We attempted to visualize the semibulk and Visium datasets in the same UMAP plot. After reducing the influence derived from the difference in the methodologies by employing the integration procedure of Seurat, we plotted the respective datasets on the same planar. The data showed a general overlap. Nevertheless, analyzing in more detail, there were several places which did not show an exact overlap. For example, the semibulk data showed a higher rate of data points for glomerular cells (Cluster 0) and S1 proximal tubules (Cluster 6) and a lower rate of data points for S3 proximal tubules (Cluster 2) and subset of distal tubules (Cluster 5) compared with the Visium data.
Glomerular cells and S1 proximal tubules are located in the cortex 27 . On the other hand, S3 proximal tubules and the subset of distal tubules are located in the outer medulla 27 .
All indicated a higher representation for the medulla on the Visium side. The plots for the saturation curve of sbRNA-seq for mouse kidney and breast cancer specimens (Cases A and B). The reads of sbRNA-seq were randomly sampled, and the median number of genes and UMI per semibulk was estimated.
For clinical samples, because the quality was not always high, the number of UMIs and genes was not always high (Supplementary Fig. 16 and Supplementary Table   5), which also occurred when the commercial single cell platform is used. However, for the mouse kidney, the median number of detected genes per semibulk was 1,700, and the median UMIs per semibulk were 3,500 (Supplementary Table 4), which is compatible with the common single cell datasets. For example, in public scRNA-seq datasets for human peripheral blood mononuclear cells released by 10x Genomics (v3.1 Chemistry), the median number of genes per cell was 1,800-2,200, and the median UMI counts per cell were 5,700-7,700, respectively 55,56 . Therefore, we believe that the numbers for sbRNA-seq should not be surprising at a given sequencing depth. Moreover, note that the sequencing depth of the semibulks datasets was not saturated in the mouse kidney data ( Supplementary Fig. 16). Therefore, we expect that these numbers could be somewhat improved by additional sequencing.